Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers

نویسندگان

  • Majdi Sawalha
  • Eric Atwell
چکیده

Arabic morphological analysers and stemming algorithms have become a popular area of research. Many computational linguists have designed and developed algorithms to solve the problem of morphology and stemming. Each researcher proposed his own gold standard, testing methodology and accuracy measurements to test and compute the accuracy of his algorithm. Therefore, we cannot make comparisons between these algorithms. In this paper we have accomplished two tasks. First, we proposed four different fair and precise accuracy measurements and two 1000-word gold standards taken from the Holy Qur’an and from the Corpus of Contemporary Arabic. Second, we combined the results from the morphological analysers and stemming algorithms by voting after running them on the sample documents. The evaluation of the algorithms shows that Arabic morphology is still a challenge. 1 Three Stemming Algorithms We selected three stemming algorithms for which we had ready access to the implementation and/or results. Shereen Khoja Stemmer : We obtained a Java version of Shereen Khoja’s stemmer (Khoja,1999). Khoja’s stemmer removes the longest suffix and the longest prefix. It then matches the remaining word with verbal and © 2008. Licensed under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported license (http://creativecommons.org/licenses/by-ncsa/3.0/). Some rights reserved. 1 Tim Buckwalter web site: http://www.qamus.org noun patterns, to extract the root. The stemmer makes use of several linguistic data files such as a list of all diacritic characters, punctuation characters, definite articles, and 168 stop words (Larkey & Connell 2001). Tim Buckwalter Morphological analyzer: Tim Buckwalter developed a morphological analyzer for Arabic. Buckwalter compiled a single lexicon of all prefixes and a corresponding unified lexicon for suffixes instead of compiling numerous lexicons of prefixes and suffix morphemes. He included short vowels and diacritics in the lexicons 1 . Tri-literal Root Extraction Algorithm : AlShalabi, Kanaan and Al-Serhan developed a root extraction algorithm which does not use any dictionary. It depends on assigning weights for a word’s letters multiplied by the letter’s position, Consonants were assigned a weight of zero and different weights were assigned to the letters grouped in the word “ ” where all affixes are formed by combinations of these letters. The algorithm selects the letters with the lowest weights as root letters (Al-Shalabi et al, 2003). 2 Our Approach: Reuse Others’ Work The reuse of existing components is an established principle in software engineering. We procured results from several candidate systems, and then developed a program to allow “voting” on the analysis of each word: for each word, examine the set of candidate analyses. Where all systems were in agreement, the common analysis is copied; but where contributing systems disagree on the analysis; take the “majority vote”, the analysis given by most systems. If there is a tie, take the result produced by the system with the highest accuracy (Atwell & Roberts, 2007). 3 Experiments and Results Experiments are done by executing the three stemming algorithms, discussed above, on a randomly selected chapter number 29 of the Qur’an “Souraht Al-Ankaboot” “The Spider” in Eng lish see figure 1; and a newspaper text taken from the Corpus of Contemporary Arabic developed at the University of Leeds, UK. We selected the test document from the politics, sports and economics section, taken from newspaper articles, see figure 2 (Al-Sulaiti & Atwell, 2006). Each test document contains about 1000 words. We manually extracted the roots of the test documents’ words to compare results from different stemming systems. Roots extracted have been checked by Arabic Language scholars who are experts in the Arabic Language. Table 1 shows a detailed analysis been done for the sample test documents, the Qur’an corpus as one unit, and a daily newspaper of contemporary Arabic test document, taken from Al-Rai Figure 1: Sample from Gold Standard first document taken from Chapter 29 of the Qur’an. daily newspaper published in Jordan. The analysis also shows that function words such as “ ” “fi” “in”, “ ” “min” “from”, “ ” “Ala” “on” and “ ا” “Allah” “GOD” are the most frequent words in any Arabic text. On the other hand, non functional words with high frequency such as “ت ا” “Al-Jami’at” “Universities” and “ ا” “Al-Kuwait” “Kuwait” gives a general idea about the main topic of the article. Simple tokenization is applied for the text of the gold standard documents. This will ensure that test documents can be used to test any stemming algorithm smoothly and correctly. 4 Four Accuracy measurements In order to fairly compare between different stemming algorithms we applied four different Figure 2: Sample from Gold Standard document taken from the Corpus of Contemporary Arabic. Table 1: Summary of detailed analysis. Qur’an Corpus Gold Standard First Document Chapter 29 of the Qur’an Gold Standard Second Document “Corpus of Contemporary Arabic” Al-Rai daily Newspaper Test Document Total number of Tokens 77,789 987 1005 977 Word Types 19,278 616 710 678 Top 10 Tokens Token Freq. Token Freq. Token Freq. Token Freq.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing Morphological Analysers for South Asian Languages: Experimenting with the Hindi and Gujarati Languages

A considerable amount of work has been put into development of stemmers and morphological analysers. The majority of these approaches use hand-crafted suffix-replacement rules but a few try to discover such rules from corpora. While most of the approaches remove or replace suffixes, there are examples of derivational stemmers which are based on prefixes as well. In this paper we present a rule-...

متن کامل

Enhancing Root Extractors Using Light Stemmers

The rise of Natural Language Processing (NLP) opened new possibilities for various applications that were not applicable before. A morphological-rich language such as Arabic introduces a set of features, such as roots, that would assist the progress of NLP. Many tools were developed to capture the process of root extraction (stemming). Stemmers have improved many NLP tasks without explicit know...

متن کامل

Nahla A Belal Enhancing Root Extractors Using Light Stemmers

Nahla A Belal Enhancing Root Extractors Using Light Stemmers The rise of Natural Language Processing (NLP) opened new possibilities for various applications that were not applicable before. A morphological-rich language such as Arabic introduces a set of features, such as roots, that would assist the progress of NLP. Many tools were developed to capture the process of root extraction (stemming)...

متن کامل

Light Stemming for Arabic Information Retrieval

Computational Morphology is an urgent problem for Arabic Natural Language Processing, because Arabic is a highly inflected language. We have found, however, that a full solution to this problem is not required for effective information retrieval. Light stemming allows remarkably good information retrieval without providing correct morphological analyses. We developed several light stemmers for ...

متن کامل

Comparative Study of Various Persian Stemmers in the Field of Information Retrieval

In linguistics, stemming is the operation of reducing words to their more general form, which is called the ‘stem’. Stemming is an important step in information retrieval systems, natural language processing, and text mining. Information retrieval systems are evaluated by metrics like precision and recall and the fundamental superiority of an information retrieval system over another one is mea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008